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Abstract —We study student behavior and performance in two 
Massive Open Online Courses (MOOCs). In doing so, we present 
two frameworks by which video-watching clickstreams can be 
represented: one based on the sequence of events created, and 
another on the sequence of positions visited. With the event- 
based framework, we extract recurring subsequences of student 
behavior, which contain fundamental characteristics such as 
reflecting ( i.e., repeatedly playing and pausing) and revising (i.e., 
plays and skip backs). We find that some of these behaviors are 
significantly associated with whether a user will be Correct on 
First Attempt (CFA) or not in answering quiz questions, and in 
ways that are not necessarily intuitive. Then, with the position- 
based framework, we devise models of performance based natu¬ 
rally on user behavior. In evaluating these models through CFA 
prediction, we find that three of them can substantially improve 
prediction quality in terms of accuracy and FI, which underlines 
the ability to relate behavior to performance. Since our prediction 
considers videos individually, these benefits also suggest that our 
models are useful in situations where there is limited training 
data, e.g., for early detection or in short courses. 

Index Terms —Clickstream Data, Data Mining, Performance 
Prediction, MOOC, Learning Analytics 

I. Introduction 

O VER the past decade, technology advances have been 
influencing the ways we can learn. One of the promi¬ 
nent innovations has been the Massive Open Online Course 
(MOOC). MOOC providers such as Coursera, edX, and 
Udacity have offered courses reaching out to tens, and even 
hundreds of thousands of students within single sessions 0. 

One salient feature of MOOC is its high attrition rates, 
with typically less than 10% of students initially enrolled in 
a course seeing it to completion. These low completion rates, 
attributed to factors such as small teacher-to-student ratios, the 
asynchronous nature of interaction, and diverse demographics, 
have made MOOC the subject of controversy as the future of 
higher education is explored 0- This has in turn ignited a 
growing body of research interest in understanding why these 
dropoff rates occur 0, 0, and in designing mechanisms to 
improve the quality of learning on MOOCs, such as: through 
early detections of students with low performance 0 or 
high dropoff likelihoods |6j|, [7|, through recommendations 
for discussion participation (3J or for certain allocations of 
peer grading |8j, and through automated individualization |2|. 

A standard MOOC will contain three different learning 
modes for students: video lectures, assessments (e.g., in¬ 
video quizzes, homework assignments, and exams), and social 
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networking (usually through discussion forums) [5J. Most 
platforms track student interaction with these different forms 
of learning, with backends designed to collect measurements 
as a student navigates through the course. For video content, 
these measurements include clickstream events, with a click 
event being generated and stored each time a learner interacts 
with a video, specifying the particular action (e.g., pause, 
skip), position, and time at which it occurred. For assessments, 
the specific responses to individual questions are tracked. For 
the discussion forums, the sequence of posts and comments 
are stored. This type of big data has been the focus of a 
number of recent studies in machine learning and data mining 
on understanding how MOOC users learn 0, |7j. |9|. 
Motivation and objectives. What remains understudied, how¬ 
ever, is the relationship between these learning modes. In 
particular, is it possible to associate a student’s behavior 
with his/her performance in a MOOC? Developing such an 
understanding would have implications not only to theories 
about how humans process information, but also to systems 
for improving low completion rates. For example, systems for 
individualized content delivery have largely been driven by 
algorithms that model users solely based on their assessment 
performance. This tends to be a sparse source of information 
about users in MOOCs, since many users complete few 
assessments (5j. Uncovering relationships between behavior 
and performance would allow individualization algorithms to 
be augmented with behavioral signals to determine the most 
suitable path of learning for each student to take, as suggested 
in 0. These relationships could also be provided to course 
instructors directly, in the form of extended learning analytics 
1101, to give instructors insight into which parts of their content 
contribute to more effective learning outcomes in their courses. 

Our work is motivated by this fundamental question of if, 
and how, it is possible to relate behavior to performance. In 
our investigation, we focus on the video-watching behavior 
of MOOC students, where users spend the majority of their 
time learning 0 - These videos are typically equipped with 
quiz questions, which serve as immediate feedback of the 
knowledge a student gained from the content in the video. 
In relating behavior to performance, then, we can consider (i) 
the clickstreams generated by a user in watching the video 
associated with a particular quiz (i.e., the behavioral aspect), 
and (ii) whether the user was Correct on First Attempt (CFA) 
or not (non-CFA) in answering the given quiz question (i.e., 
the performance aspect). 

In our investigation, we formalize different ways that video¬ 
watching clickstreams can be represented as sequences, and 
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Dataset 

Lectures 

Lecture 

Videos 

Video Length (min) 
avg. (s.d.) 

Quizzes 

Users 

Clickstream 

Events 

User-Video 

Pairs 

CFA Score 
avg. (s.d.) 

‘FMB’ 

20 

92 

16.9 (5.96) 

92 

3770 

314,632 

26,250 

0.663 (0.473) 

‘NT 

6 

115 

5.44 (2.17) 

69 

2680 

416,214 

36,464 

0.750 (0.433) 


Fig. 1: Basic information on the two datatsets. The values in the right column are the final numbers after data filtering. 


apply the frameworks we develop to meet two objectives: 
Objective 1 (01): To identify recurring behaviors of learners, 
such as revising content or skipping forward repeatedly. 
Objective 2 (02): To assess the impact of behavior on perfor¬ 
mance, i.e., how patterns identified in Ol and specific positions 
visited in each video are signals of effective learning. 

Previous work on studying the video-watching clickstreams 
of students |7j has focused on the sequence of events (e.g., 
pause, skip forward) generated. In studying 01&2, we iden¬ 
tify two additional factors that are important to capture: the 
positions in the video that a student visited, and the duration 
/length of time between the events and positions. These form 
the basis of our clickstream representation frameworks. 

In our investigation, we employ two datasets coming from 
two different MOOCs we have instructed on Coursera. After 
filtering (described in Sec. [II}, these datasets contain 315K 
and 416K clickstream events corresponding to 26K and 36K 
first-attempt quiz submissions by students. With these datasets, 
our study is specifically broken down into two components: 
behavioral motifs and behavior-based prediction, as follows. 
Behavioral motifs. We first develop an event-based framework 
to represent clickstreams (Sec. 0 which captures event types 
and their lengths. Leveraging this framework, we are able to 
identify video-watching motifs, i.e., sub-sequences of student 
behavior that occur significantly often, in our two datasets. 
These motifs by themselves are informative of recurring 
behaviors for Ol (Sec. and additionally, we are able 

to identify a significant difference in the presence of certain 
motifs between the CFA and non-CFA sequences for 02. For 
example, we find that a series of behaviors are indicative of 
students reflecting on material, and are significantly associated 
with the CFA sequences in one of our courses. As another 
example, we identify motifs that are consistent with rapid- 
paced skimming through the material, and reveal that these are 
discriminatory in favor of non-CFA in both of our courses. 

For these motifs, the identified association with CFA or 
non-CFA (when one exists) is particularly helpful, because for 
many of them, either case is conceivable. For one, skimming 
could intuitively be a sign of a student either correctly or 
incorrectly perceiving familiarity with the material; our results 
indicate the latter is more likely. Also, we find that incorporat¬ 
ing the lengths in addition to the events is important to these 
findings, because extracting motifs from sequences of events 
alone does not reveal these insights. 

Behavior-based prediction. In investigating 02, we will also 
develop models for knowledge gained based on the clicks that 
a student makes in a video. The quality of such a model can 
be evaluated by considering its ability to generalize to incom¬ 
ing samples through prediction. The higher the quality, the 
stronger the association between behavior and performance. 

To this end, we will also study student performance predic¬ 


tion (specifically, CFA prediction) for MOOC. Enhancing CFA 
prediction is an important area of research in its own right, 
because such methods can improve systems for early detection 
of e.g., struggling/advanced students and easy/difficult material 
0 - In seeking appropriate models for student performance, 
we find that while some behavioral patterns of the motifs are 
significantly associated with performance, their supports and 
the resulting success estimates are not sufficient to make large 
improvements in CFA prediction. As a result, we propose 
a second behavioral representation, which is based on the 
sequence of positions visited in a video (Sec.|IV|i. In contrast to 
training over a long course duration as in |5[. |11), we consider 
CFA prediction on a per-video basis, in order to quantify the 
benefit obtained by the positions in each individual video. 

We evaluate three different models based on our framework 
(Sec. 0- and find that they obtain substantial improvements 
in prediction when compared to a baseline that does not 
use click information. This underscores the ability to relate 
clicks to knowledge gained, i.e., that behavior is related to 
performance, and shows that behavioral information is useful 
in situations where multiple videos are not be available, e.g., in 
short courses or for detection early in a course. Further, since 
our algorithms are natural representations of student behavior 
(e.g., sequences of positions visited), they can be used to guide 
student actions while watching a video in real time. 
Summary of contribution. Compared with other work (Sec. 
|VI| ), we make three main contributions in this paper: 

1) We develop two new frameworks for representing stu¬ 
dent video-watching behavior as sequences. 

2) We extract the recurring behavioral motifs of students 
watching videos using motif identification schemes, and 
associate these fundamental patterns with performance. 

3) We demonstrate that video-watching behavior can be 
used to enhance student performance prediction on a 
per-video basis, e.g., for earliest detection. 

II. Datasets and Clickstreams 

In this section, we describe our datasets, and present our 
first sequence specification based on events and lengths. 

A. Our Two MOOCs 

Our datasets come from two different courses that we have 
instructed on Coursera: Networks: Friends, Money, and Bytes 
(‘FMB’) and Networks Illustrated: Principles Without Calcu¬ 
lus CNF)[] Each of these courses teach networking topics, 
but ‘FMB’ delves into the mathematical specifics behind the 
topics, whereas ‘NT is meant as an introduction to the subject 
(see 0 for more details). We obtained two types of data 
from Coursera for each of the courses: (i) video-watching 

1 www.coursera.org/course/{friendsmoneybytes,ni} 
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<play,Pi,ti < skip,p 2 ,t 2 , < skip,p 4 ,t 4 < pause,p 3 ,i 3 

playing,r> playing, r> paused, r> paused, r> 

Fig. 2: Illustration of a sequence of clicks E\ to E 4 on a video, where 
the horizontal axis denotes the video length. This will generate 5 
events according to our first framework based on events and lengths. 
The length lj for the events that have this property (note that pauses 
do not have lengths) are depicted above the diagram. 

clickstreams, which log user interaction with the video player, 
and (ii) information on the in-video quiz submissions. We 
will describe the format of the video-watching clickstreams in 
detail in Sec. |II-Bl] in developing a representation framework. 
Course format. The course formats are summarized in Fig. 
[T] Each is made up of a series of lectures, which are in turn 
comprised of a set of videos. ‘FMB’ is a longer course, with 
20 lectures, whereas ‘NT only has 6. ‘NT had more, shorter- 
length videos, with a total of 115 videos and an average (avg.) 
length of 5.4 min per video, whereas ‘FMB’ has less, longer- 
length videos, with 93 total and an avg. length of 16.9 min. 

For each course, we included in-video quizzes at the end 
of the videos, to test a student’s understanding of the material 
throughout the course. Each quiz is a multiple choice question, 
in radio-response format, with 4-5 possible answer choices. 
For ‘FMB’, there was one question at the end of each video, 
whereas for ‘NI’, each of the 69 questions was associated 
with anywhere from 1-4 videos. In mapping videos to quizzes, 
we will refer to “video X” as the contiguous set of videos 
occurring after question X — 1 and before question X. 

User-Video Pairs. We extract User-Video (UV) Pairs from the 
data, with two sets of information for video and quiz X: 

(i) Video-watching trajectory. The set of clickstream logs 
(events) for the user in video X. 

(ii) CFA result : Whether the user was Correct on First Attempt 
(CFA) or not (non-CFA) for quiz X. 

In total, there were 122.5K UV Pairs for ‘FMB’, with 566K 
click events. For ‘NI’, these numbers were 149K and 882K, 
respectively. After removing any UV Pair that had at least one 
null, stall or error contained in its video-watching trajectory, 
we were left with the numbers given in Fig. [T] The avg. CFA 
score across the UV Pairs was 0.663 for ‘FMB’ (standard 
deviation (s.d.) = 0.47), and 0.750 for ‘NI’ (s.d. = 0.43). 

B. Processing Clickstream Events 

1) Our nomenclature for events: A clickstream log is one 
of four types: play, pause, ratechange, or skip. Each 
time one of these events is fired, a data entry is recorded that 
specifies the user and video IDs, event type, playback position, 
playback speed, and UNIX timestamp for the event. 

Formally, let Ej denote the vth click event that occurs while 
a user is watching a video. We write £) = (ej,pj, fj, Sj, r*), 
where e, is the type of the /th click, p, is the video position 
(in sec) right after E, is fired, ti is the UNIX time (in sec) 
at which Ej was fired, s,; is the state of the video player - 
either playing or paused - as a result of E), and r, is the 
playback rate ( i.e., speed) of the video player resulting from 


this event. The logs are sequenced chronologically for a UV 
Pair, i.e., t\ < tj < ■ ■ ■ . Based on the E, for a UV Pair, we 
define the following events: 

Play (PI): A play event begins at the time when a click event 
Ei is made for which the state Sj is playing, and lasts until 
the next click E l+ -\. It occurs for a duration d = t l+ \ — ti and 
has a length l = Pi+i — pi. 

Pause (Pa): A pause event is defined in the same way as a 
play event, except it is for which the state Si is paused, and 
does not have any length by definition. 

Skip back (Sb): A skip back (i.e., rewind) event occurs when 
the type = skip and pi > pi, where pi is the position of the 
video player immediately before the skip. If = playing, 
then p\ = pi-\ + (ti — ti— 1 ) • Ti- 1 ; if = paused, then 
pi = pi-i. The length of the skip is l = \pi — pi |, and there 
is no associated duration. 

Skip forward (Sf): A skip forward (i.e., fast forward) event 
is defined as Sb, except it captures the case where p, > pi. 
Ratechange fast (Rf): This occurs when e t = ratechange 
and the rate n > l.oQ There is no duration or length. 
Ratechange slow (Rs): This occurs when e, = ratechange 
and 7'i < 1, again with no duration or length. 

Ratechange default (Rd): This is when e, = ratechange 
and rj = l, i.e., returning to the default. 

With these, the sequence of events for a UV Pair becomes 
ei,e 2 ,... for ej e £ = {PI, Pa, Sb,...}, \£\ = 8 . Each ij may 
have an associated duration dj and/or length lj. Fig. [2] shows 
a schematic to illustrate this; the clickstream logs here would 
generate: PI, with l\ = — t\)-r and d\ = t -2 — ty, Sf, with 

h = P 2 —P 2 '’ PI’ w Rh h = P3 ~P 2 and c /3 = (,3 — 1‘2 ; Pa, with 
di — ti —f 3 ; Sb, with l 5 = p' 4 —p^. Note that we are inserting 
PI and Pa events in-between other events, to incorporate the 
state of the video player during those times. This critical 
information is not captured through only the events in the raw 
data, and has been neglected in other work (e.g., in 
Denoising clickstreams. It is important to remove noise in 
the video-watching trajectories associated with unintentional 
user behavior. We handle two cases of events separately: 
Combining events'. We combine repeated, sequential events 
that occur within a short duration (5 sec) of one another, since 
this pattern indicates that the user was adjusting to a final state. 
This is a common occurrence with forward (Sf) and backward 
(Sb) skips, where a user repeats the same action numerous 
times in a few seconds in seeking the final position; this should 
be treated as a single skip to the final location. Similarly, 
a series of Rf or Rs events may occur in close proximity, 
indicating that the user was in the process of adjusting the rate 
to the final value. Formally, if there is a sequence of clicks 
Ei,E i+ i,...,E i+K for which e* = e i+1 = ••• = e i+K and 
fi+fe+i - fj+fc <5 Vfc € {0, ...,K — 1}, then we use Ei = 
(t^iiPi+K iti, Si^-K I'f ill place of Ei, Ei+i, ..., Ei+K. 
Discounting intervals: We identify two instances in which play 
(PI) and pause (Pa) events should not be inserted between 
Ei and E’, +1 . First is if Ei and E t+ \ occur on two different 
videos; here, there is no continuity as the user must have exited 

-On Coursera, the default player speed is 1.0, and users can vary this 
between 0.5 and 2.0, in increments of 0.25. 
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Sf (Nl) 

Sb (Nl) 

Pa (Nl) 

PI (Nl) 

Sf (FMB) 

Sb (FMB) 

Pa (FMB) 

PI (FMB) 

10' 1 10 ° 10 1 10 z 10 3 

Distributions of d (PI and Pa) and I (Sb and Sf), in sec 

(a) Boxplots of the distributions for each dataset. 



Dataset 

Event 

Size 

Frac 

Q 1 

Q2 

Q 3 


PI 

112.7K 

53% 

13.9 

67.5 

282.4 

‘FMB’ 

Pa 

51.2K 

24% 

9.6 

31.9 

102.4 

Sb 

29.4K 

14% 

17.7 

35.4 

72.7 


Sf 

18.2K 

8.6% 

21.2 

63.7 

227.2 


PI 

103.5K 

58% 

12.0 

71.0 

262.6 

‘NF 

Pa 

46.4K 

26% 

4.5 

19.3 

58.8 

Sb 

17.8K 

10% 

12.9 

26.2 

54.7 


Sf 

10.7K 

6.0% 

9.6 

28.4 

81.7 


(b) Tabulated statistics for the distributions. 

Fig. 3: Distribution of the lengths for four events across both ‘NT 
and ‘FMB’. For PI and Pa, this represents the time elapsed before 
the next event, and for Sb and Sf, this is the distance of the skip. 

the first video and then opened the second one. Second is if 
the duration tj+i —t* is extremely long; in this case, it is likely 
that the user was engaging in some off-task behavior during 
this time. If Si = paused, the threshold on the duration is set 
to 20 min (as in 112 j for web inactivity); if s, = play, then 
the threshold is set to the length of the video. 

2) Event lengths: We now look to discretize the length { 
and duration dj of the events. Fig. [3ja) gives the boxplots of 
the event distributions from each course, dj for PI and Pa is 
shown, and we depict lj for Sb and Sf (we show only values 
that are at least 0.1 sec). Basic statistics of each distribution 
are also given in Fig. [3 b); specifically, the three quartiles Q \, 
Q 2 . and Q.i are shown J as are the number of events for each 
distribution (Size) and the respective fractions (Frac). 

We make three observations in comparing the distributions. 
In each case, we employed a Wilcoxon Rank Sum (WRS) 
GD test for the null hypothesis that there was no difference 
between the distributions for each dataset overall, and report 
the p-values ( p ) from those tests j^] 

(i) ‘FMB’ has longer events: The distributions for each event 
are shifted to the right for ‘FMB’ relative to those for ‘NT, 
meaning that ‘FMB’ tends to have longer events. In each of 
the four cases (PI, Pa, Sb, and Sf), the p-values ( p ) were highly 
significant {p ss 0). The fact that Pa is longer for ‘FMB’ is 
consistent with that content being more difficult. 

(ii) Sf is longer than Sb: The distribution of Sf is shifted to 
the right relative to Sb for both ‘FMB’ and ‘NF (p <lE- 6 ). 
This indicates that when students skip forward, they tend to 
pass more material than they revise when skipping back. Sb 
also occurs more frequently than Sf for both courses. 

(Hi) PI is longer than Pa: The distributions for play and 
pause in both datasets indicate that users tend to stay in the 
playing state longer than paused (p ss 0). This is stronger 

3 By definition, quartiles separate data in increments of 25%. 

4 We use the WRS test because Shapiro-Wilk tests detected significant 
departures from normality for each of the distributions. 


in the case of ‘NF, which is again consistent with the fact that 
the ‘FMB’ material is more difficult. 

Event intervals. Clearly, lj and dj can vary substantially 
between events and datasets. To account for this relative 
variation, we will use the four intervals in-between the three 
quartiles for each event (given in Fig. [3jb)) to discretize the 
lengths. We specify three cases: 

(i) ij £ {Sb, Sf}: When the event is a skip, we map it 
to (ij qj), where qj £ { 1 , 2 ,3,4} is chosen such that 
lj £ [Q qj ^i,Q qj ), with Q o = 0 and Q 4 = 00 . For example, 
suppose that event E) is such that e :j = Sb and lj = 20 sec. 
In either course, this would be mapped to Sb2. 

(ii) ij = Pa: In this case, the mapping works the same as the 
previous, except qj is chosen based on dj instead. 

(iii) ij = PI: Two long duration play events could still have 
different qualitative interpretations Pi To account for this, when 
ij = PI, we map it to (ij qjj ij qj y 2 • • • ij Qj,x), where 
qjk £ {1,2,3} for k = 1,..., K is chosen according to: 

J 3, dj — 5j, k > Q 3 

Qj k = \ 

{ argmin q . K (dj - 5j. K < Q qjK ), otherwise, 

with Sj t k = Xlfci Qqj k' at eac h ste P- F° r example, suppose 
an event is PI with dj = 550 sec. For the quartiles in ‘NF, 
this would be mapped toP13 P13 P12. 

3) Event-type sequence specification: Let S = {P11,P12, 
P13, Pal,..., Pa4, Sbl,..., Sb4, Sf 1,..., Sf4, Rf, Rs, Rd} be the 
set of |<S| = 18 events (with quantized lengths). For each 
UV Pair, we encode the clickstream log Ei,...,E n as S = 
(si, S2,..., s n >) where each Sj £ S is chosen according to the 
specifications in Sec. |II-B2| As we will see in Sec. |IIIJ using 
this alphabet that incorporates event types and lengths allows 
us to obtain insights that cannot be gleaned with events alone. 

For comparison, we will refer to an event with length 1 as 
“short,” 2 as “medium,” 3 as “medium-long,” and 4 as “long.” 


III. Motifs of Video-Watching 
Using the event-type specification, we identify short, re¬ 
curring sub-sequences within user behavior, i.e., behavioral 
motifs. As we will see in Sec. III-B these motifs capture 
fundamental video watching characteristics of students such as 
reflecting on or revising material. We will also see that some 
of these motifs are significantly associated with performance. 


A. Motif Extraction 

We make use of the MEME Suite software package [14J for 
motif extraction. MEME has been applied in bioinformatics 
for motif identification in sequences of nucleotides and amino 
acids. We turn meme MEME to be applicable in our setting. 
Model and algorithm. The underlying algorithm is based on a 
probabilistic mixture model, where the key assumption is that 
each subsequence is generated by one of two components: 
a position -dependent motif model, or a position -independent 
background model. Under the motif model, each position j 
in a motif is described by a multinomial distribution, which 

5 The other events do not have this issue since they are not related to 
processing new, incoming information. 
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specifies the probability of each character ( i.e., each s £ S 
from Sec. II-B3[ ) occurring at j. The background model is 
a multinomial distribution specifying the probability of each 
character occurring, independent of the positions; we employ 
the standard background of a 0-order Markov Chain. A latent 
variable is assumed that specifies the probability of a motif 
occurrence starting at each position in a given sequence 03- 

Motif extraction is formulated as a maximum likelihood 
estimation over this model, and an expectation-maximization 
(EM) based algorithm is used to maximize the expectation 
of the (joint) likelihood of the mixture model given both the 
data (i.e., the sequences) and the latent variables. We use the 
standard dirichlet prior based on character frequencies for EM. 
Extraction. Each UV-Pair’s clickstream sequence is encoded 
using the 24-character protein alphabet |14j]. To do this, we 
choose the first 18 non-ambiguous characters T, and then 
specify a 1:1 mapping S J 7 . Whereas other work has 
focused on a single motif width ( e.g., at 4 in |7J), we extract 
those of widths w £ {4,..., 10} from our datasets, with E- 
values (see below) at most 0.05; we will see that both long 
and short motifs can be insightful (see Fig. |6j. 

For each motif, we obtain its E-value, and its position 
specific probability matrix (PSPM): 

E-value: The /l-value judges overall significance. It is defined 
as the fraction of motifs (with the same width and occurrences) 
that would have higher log likelihood ratio if the sequences 
had been generated according to the background model. 
PSPM: This gives the fraction of times that each character 
appears in each position of the motif, taken over all sightings 
of the motif in the dataset. In the following, denote the PSPM 
for a motif by P = where is the fraction of times 

event j occurs at position i. 

Representation. At each position i, we consider all events 
j with pij > 0.25^] Formally, let A, be the sequence 
of indices into the event set S for i, arranged such that 
Pi,Ai(k) > Pi,Ai(k+ 1 ) and Pi, Ai (k+ 1 ) > 0.25 Vfc. Then, there 
are three cases on the way i is represented: if \Ai\ > 1, i is 
represented as [5/^m 5^(2) • ■ ■ ]; if \A t \ = 1, then the square 
brackets are omitted, with just displayed; if A, = 0, then 
i is displayed as ‘*’ to indicate that this position was taken by 
a variety of events, none of which occurred even 25% of the 
time. For example, the sequence [P12 P13] Pal * [Sfl 
S f 2 S f 4 ] is of length 4, with the first position being either 
P12 or P13 at least 50% of the time (PI2 at least as often 
as P13), the second position being Pal at least 25% of the 
time, the third position being any event, and the last being 
either Sf, Sf2, or Sf4 at least 75% of the time. 

Motif support. For each motif, we obtain the fraction of 
sequences (FS) in which it occurs, i.e., its support across 
sequences, as well as the number of videos it appears in. We 
also obtain FS0 and FS 1 as the fraction of non-CFA and CFA 
sequences in which the motif appears, respectively. 
Significance test: We determine whether there is a significant 
difference in the support of a motif across the CFA and non- 
CFA classes by running a two-sample test for proportions G3 


6 With 18 different events, a threshold of 25% is roughly 5 times the 
expected occurrence from a purely random selection of events. 



(a) ‘FMB’ (b) ‘NT 

Fig. 4: ECDFs of the number of sequences that each motif appears 
in, across both CFA and non-CFA. The supports are consistent across 
both groups. 



(a) 'FMB' (b) -NI' 


Fig. 5: ECDFs of the number of videos that each motif appears in, 
across both CFA and non-CFA sequences. CFA sequences have a 
higher support for motifs across videos. 

for the null hypothesis Hq that FS 1 = FS0, with an alternative 
hypothesis Hi that FS1 7^ FS0. If the two-sided p-value (p) for 
this test is low enough (< 0.05), then the difference between 
the supports is significant, i.e., the motif is found in the class 
with higher support significantly more often. We also compute 
the estimated probability of success p £ [0,1] (i.e., of a CFA 
submission) for a sequence containing the motif, from the 
midpoint of the confidence interval returned by this test. 

B. Results 

We obtained 87 and 123 motifs from ‘FMB’ and ‘NT, 
respectively, which are the subject of the following analysis. 
Motif overview. We first analyze how the motif supports 
vary across sequences and videos. Overall, we find that the 
motifs are reasonably supported across sequences and videos 
on average, for both CFA and non-CFA in each course. 
Sequences: In Fig. |4j we plot the Empirical CDF (ECDF) 
of the fraction of sequences that each motif appears in, for 
both CFA and non-CFA. The supports are similar across these 
groups: for ‘FMB’, each motif appears in 5.9% of the non- 
CFA sequences on average, and 6.5% of the CFA; for ‘NI’, this 
is 5.8% for CFA, and 4.2% of the non-CFA. In both courses, 
the motifs with largest support (first row in Fig. [6ja) and (b)) 
appear in > 25% of the sequences. 

Videos: Fig. [5] gives the ECDF of the number of videos that 
each motif occurred in at least once and at least 10 times. 
Overall, CFA has higher support than non-CFA over videos. 
We also see that the supports decrease for higher thresholds, 
e.g., for ‘FMB’ in (a), while the top 20% of the motifs appear 
in at least 67 videos for CFA, this drops to only 18 videos 
considering at least 10 occurrences. 

1) Individual motifs: We inspect patterns in the most sig¬ 
nificant of the 210 extracted motifs. This list is obtained by 
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Group 

Motif 

E-value 

FS(%) 

FS0(%) 

FS1(%) 

p(%) 

p-value 


I 


[P12 P13] [Pa4 Pa3] [P12 Pll] [Pa2 Pa3] P12 Pa3 P12 [Pa2 Pa3] P13 


5.3E-64 

28.5 

26.2 

29.5 

53.3 

4.6E-4** 

Pa 

II 


P12 Pa4 P12 Pa4 


1.5E-06 

13.2 

13.2 

13.3 

50.1 

0.901 

III 


[Pal Pa3] Pll [Pa2 Pal] Pll [Pal Pa2] [Pll P12] [Pal Pa2] [Pll P12] 

fa 0 

12.1 

11.3 

12.5 

51.2 

0.0745 


IV 


[Pll P13 P12] Pa4 P12 Pa3 [P12 Pll] * P12 * [P13 Pll] Pa3 


1.5E-15 

10.9 

9.3 

11.6 

52.3 

3.9E-4** 


I 


Sb3 [P12 Pll] [Sb2 Sb3] P12 Sb2 P12 [Sb2 Sb3] [P13 P12] 


6.0E-245 

10.2 

8.84 

10.8 

51.9 

2.0E-3** 

Sb 

II 


P13 Sb3 P12 [Sb3 Sb2] P13 


6.7E-40 

8.92 

8.11 

9.28 

51.2 

0.048* 


III 


P12 Sb2 [Pll P12] [Sb2 Sb3] P13 


0.044 

7.79 

6.57 

8.32 

51.7 

1.6E-3** 

Sf 

I 


P12 Sf3 [Pll P12] Sf2 [Pll P12] Sfl [P12 Pll] [Sf2 Sfl] 


fa 0 

9.46 

10.03 

9.22 

49.2 

0.186 

II 


[P12 Pll] [Sf2 Sf3] Pll [Sf3 Sf2] 


fa 0 

6.42 

7.41 

5.98 

48.6 

4.95E-3** 

Rf 

I 


P13 [Rf Rd] [P12 Pll] Rf [P13 P12] Rf 


fa 0 

4.55 

3.89 

4.84 

50.9 

0.0295* 

II 


Rf Rd [Pll P12] Rf P13 


1.2E-70 

1.77 

1.22 

2.00 

50.8 

4.7E-3** 





(a) Motifs for 'FMB’. 







Group 



Motif 

E-value 

FS(%) 

FS0(%) 

FS1(%) 

p(%) 

p-value 




I 

[P12 P13] Pa4 [P12 P13] Pa4 P13 


2E-81 

26.8 

27.6 

26.5 

48.8 

0.338 

Pa 



II 

P12 Pa4 P12 Pa4 

1.8E-44 

14.3 

15.9 

13.7 

47.9 

0.0233* 



III 

P12 Pa4 P12 Pa3 P13 

3.2E-19 

11.8 

11.0 

12.1 

51.0 

0.241 




IV 

Pll Pal Pll Pal Pll Pal [Pll P13] 


KS 0 

11.7 

12.7 

11.4 

48.7 

0.145 

Sb 


I 

[Sb3 Sb4] [P12 P13] [Sb3 Sb2] P12 [Sb3 Sb2] [P13 PI2] 

9.1E-191 

9.2 

8.6 

9.4 

50.8 

0.291 


II 

Sb3 P12 Sb2 [P12 Pll] Sb2 P12 [Sb3 Sb2] [P13 P12] 

2.2E-125 

5.3 

4.2 

5.7 

51.5 

0.014* 




I 

[P13 Pll] [Sf3 Sf4] [Pll P12] [Sf4 Sf3 Sf2] [Pll P12] [Sf3 Sf4] 

6.6E-100 

7.8 

8.9 

7.4 

48.4 

0.0279* 

Sf 



II 

P12 [Sf3 Sf2] Pll [Sf3 Sf2] 

1.1E-248 

7.7 

9.7 

7.0 

47.4 

2.2E-4** 




III 

P12 Sf3 [Pll P12] [Sf3 Sf2] [P13 Sbl] 


2.7E-3 

6.3 

7.2 

6.0 

48.8 

0.0598 

Rf 


I 

Rf [P12 Pll] Rf [P13 P12] 


fa 0 

2.3 

3.5 

1.9 

48.4 

9E-5** 


II 

[Rf Rs] [P12 Pll] Rd [P12 P13] Rf 

7.3E-16 

2.5 

3.1 

2.3 

49.2 

0.064 


(b) Motifs for ‘NF. 


Fig. 6: Representative sample of motifs identified for each course. Each motif is grouped by the dominant event it contains outside of PI. 
FS is the fraction of sequences over both CFA and non-CFA, while FSO and FS1 are for the separate cases, p is the estimated probability of 
success (CFA) if a sequence contains the motif, and the p-value (p ) is the significance of p (* indicates p < 0.05, and ** is for p < 0.01). 


applying the following procedure. First, noticing that all motifs 
contain PI events, we group them into categories based on 
the most recurring alternate event, leading to 4 groups. Then, 
within each category, we consider each motif that either (i) 
has one of the top-10 highest supports or (ii) has a significant 
p (< 0.05) comparing CFA and non-CFA supports. Finally, if 
one motif is a subsequence of another, then we remove the 
one that has lower support or is less significant. 

This yields 19 and 21 motifs for ‘FMB’ and ‘NT, respec¬ 
tively. In Fig. [6] we give the representative sample of these 40 
that are mentioned in the following discussion. Note that we 
have grouped each motif by the most frequent type of event 
that it contains aside from play. Also, each motif is assigned 
an ID consisting of its group and number (e.g., Pa II in ‘FMB’ 
is motif P12 Pa4 P12 Pa4). 

Overview. The motifs exhibit many similar structural at¬ 
tributes, which occur in spite of the fact that the encoding 
quantiles are different for each event and course (see Fig. [3}. 
Also, since MEME finds ungapped motifs (i.e., those existing 
as exact matches in the data, without a separate layer of 
similarity matching), these identified behaviors exist exactly in 
the sequences, contrary to other work (7 j which has resorted 
to approximate string searching. Interestingly, we find that the 
motif with highest support in each group also tend to have 
the longest length (average of 7.5 over all groups with at 
least two motifs). Also, we find that the motifs in the Pa 
group have the largest supports (FS) overall (> 10% mostly), 
which is consistent with the fact that there are less skip and 
ratechange events in the datasets (see Fig. [3jb)). 

We present our most interesting observations for each group: 
Reflecting (Pa). The occurrence of play together with pause 
indicates that lectures are generally thought-provoking, caus¬ 
ing students to reflect on the material they just saw. In both 


courses, the events forming the motifs in this group cover the 
entire range from short to medium-long plays (Pll - PI3) 
interspersed with short to long pauses (Pal - Pa4). 

The motif with the highest support in ‘FMB’ - Pa I - 
can be viewed as a long sequence of medium to medium- 
long plays with medium-long to long pauses in-between, also 
characteristic of Pa IV in ‘FMB’. This behavior occurs more 
often in the CFA group in both cases (p < 0.01). Motifs Pa III 
in ‘FMB’ and Pa IV in ‘NI’ are long sequences too, but consist 
of short to medium plays followed by short to medium pauses 
and do not distinguish between CFA and non-CFA (p > 0.07). 
Motifs Pa II in ‘FMB’ and Pa I in ‘NI’ are shorter sequences, 
with medium to medium-long plays followed by long pauses, 
and also do not differentiate between the groups (p > 0.33). 

In ‘NI’, the Pa group exhibits less significance in p-values. 
For this reason, we do not draw conclusions on differences 
between CFA and non-CFA from these sequences. 

Revising (Sb). From the six motifs in the Sb group, we 
identify two interesting, recurring subsequences: P12 Sb3 
P12 Sb3 (Sb I and II for ‘FMB’, and Sb I for ‘NI’), and 
P12 Sb2 P12 Sb2 (Sb III for ‘FMB’ and Sb II for ‘NI’). 
Roughly speaking, each of these is associated with playing 
for a length of video, and then revising some or all of that 
content. To see this, consider the ranges of PI and Sb from 
Fig. 0 associated with these subsequences: P12 covers 14 to 
68 sec for ‘FMB’, and Sb2 to Sb3 covers 18 to 73 sec; for 
‘NI’, these ranges are 12 to 71 sec and 13 to 55 sec. The 
play and skip ranges are closely overlapping in each case. 
Taking the extreme ends of each range, they are associated 
with skipping back anywhere from 1 min below the starting 
play point to 50 sec after itJ^J which are local considering the 
video lengths. This characteristic of local revising is further 

7 We assume a default playback rate as an approximation. 
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seen in that Sb4, a long skip back, does not appear in these 
motifs. Note that 4 of the 5 motifs containing these behaviors 
are significantly associated with CFA (p < 0.05). 

We also considered the number of skip backs originating 
at each video position across all UV Pairs. We find that the 
largest origination point of these events is at the end of videos. 
In particular, out of all Sb events, those originating within 10 
sec of the videos’ end are 16% and 13% of the total for non- 
CFA and CFA in ‘FMB’. As a reference, if we take the highest 
location of Sb for each video outside of the last 10 sec, these 
constitute 4.5% and 3.6% of the total for non-CFA and CFA. 
This, combined with the motifs suggesting revision when Sb 
occurs, implies that those students who are revising multiple 
times before answering a quiz have a higher chance of success. 
Skimming (Sf). In both of the courses, the motifs in the Sf 
group are primarily medium to long skips forward with short 
to medium plays in-between. Further, the skips are longer than 
the plays occurring before and after; comparing the lengths of 
PI and Sf events in Fig. [3] we see that for both courses, range 
Qj to Qj+i for Sf is always larger than Qj-\ to Qj for PI. 
This recurring behavior in the Sf group can then be interpreted 
as skimming through the material quickly with less exposure to 
the material. We find that 3 of these 5 motifs are significant in 
favor of non-CFA (p < 0.03). We contrast this to a finding in 
|5], where the total number of skip forwards in a sequence was 
not found to be associated with either group. This underscores 
the utility of considering the clickstream sequences, rather than 
computing aggregate quantities to summarize them. 

While Sb and Sf occurring together in a motif (e.g., Sf III 
in ‘NI’) can possibly be interpreted as skipping forward with 
caution, we find that this is also close to being significant in 
favor of non-CFA (p < 0.06). 

Speeding (Rf). Referring to Rf in ‘FMB’, motifs I and II 
indicate that viewing the material at a faster than default rate, 
i.e., speeding, is more often associated with the CFA class 
than not (p < 0.03). With these motifs, learners also return 
to the default rate (Rd), indicating they are slowing down 
for important content. To this point, in ‘FMB’, we see no 
significant motifs for slower than default rates; however, one 
does exist in ‘NI’ (Rf II). Also, Rf II in ‘NI’ is significantly 
associated with non-CFA (p = 9E-5), which could indicate 
that a faster rate is harmful in this course. 

2) Key messages: Overall, we draw a few conclusions. 
Motif groups. There are four main groups: 

Reflecting : Pausing to reflect on material repeatedly is the most 
commonly recurring behavior. If the time spent reflecting is 
not too long, but longer than the time spent watching, then a 
positive outcome is most likely (in ‘FMB’). 

Revising : Repeated revision of the material suggests students 
will gain a better understanding of the content. 

Skimming: Skimming through material quickly, even with 
caution, is costly in terms of knowledge gained. 

Speeding: Students who watch the videos at a faster than 
default rate may already be familiar with the material, leading 
to a correct answer (in ‘FMB’). They also may slow to the 
default if they sense unfamiliar material. 

Significance of associations. For each motif, the identified 
association with CFA or non-CFA is particularly important. 


because in many cases either would be intuitive. For example, 
a revising motif could presumably come from a student 
reinforcing material prior to the quiz (in line with CFA) or 
from excess confusion caused by the material (in line with 
non-CFA), but the results indicate the former is more likely. 
As another example, skimming could come from a student 
perceiving familiarity with the content in a video, which could 
intuitively be either a correct (in line with CFA) or an incorrect 
(in line with non-CFA) perception, but results favor the latter. 
Importance of lengths. We emphasize the importance of 
having included the lengths, in addition to the events, in our 
framework from Sec. lII-B3l in order to make these conclusions. 
For instance, the sequence PI Sb PI Sb identified in |7j 
cannot be associated with revising, because it is not clear 
how far back the student has skipped relative to having 
played in-between. In the same way, PI Sf PI Sf cannot 
be concluded as skimming, because the lengths of play and 
skip are not indicated in the model. Also, even small changes 
in the motif lengths can affect significance (e.g., in ‘FMB’, 
while Pa I is associated with CFA, Pa II is not). 

Clickstream motifs are useful in studying learning behavior, 
and that they can be significantly related to performance. In 
terms of using them to model behavior for CFA prediction, 
however, there are two drawbacks. First, while the supports 
are reasonable considering these are rather long subsequences, 
none of the motifs appear in a majority of the sequences 
(max 28.5%). Second, none of the p success estimates deviate 
substantially from 50% (max 3.3%). Hence, we will now turn 
to an alternate clickstream sequence representation which is 
more applicable to CFA prediction. Nonetheless, some of the 
conclusions here will guide our modeling choices. 

IV. Model of Position Sequence 

In this section, we will formalize a position-based sequence 
representation, which factors in the location in the videos that 
a student visited. Then, we will preset CFA models based on 
this framework, which will be evaluated in Sec. [V] 


A. Modeling Framework 

1) Definitions: Let v £ V denote video v in the set of 
videos V for a course, indexed chronologically (i.e., by release 
date of the videos)|^Also, let c € C denote class c in the set of 
binary classes C = {0,1}, where c = 0 indicates a non-CFA 
submission and c = 1 is CFA. With u € U as user u in the set 
of all users U, we let U v C U be the set of users who have a 
UV Pair for v, and U v ' c C U v be those who fall into class c 
with respect to their answer submission. For evaluation in Sec. 
|V] we will generate training (Uf) and test (U^,) sets as subsets 
of U v \ Uf and U'f are always chosen such that Wf nWJJ = 0. 

2) Position-based sequence specification: We will divide 

each video into a number of intervals. Let h v be the length 
(in sec) of v. We define w v to be the width that partitions v 
into N(w v ) = [h v /w v \ uniform intervals, such that interval 
i C V v (w v ) = {1,..., N(w v )} spans the range — i- 

w v \. Lor each UV Pair, we can then model the behavior as a 


8 Recall from Sec. 


II-A 


that we define a “video” to be all videos for a quiz. 





sequence of positions \i u ' v = (pi, p 2 , Pm ■ ■■), where p n £ 
V v (w v ) is the index of the nth position visited^] 

To generate these sequences, we first apply the same denot¬ 
ing procedure described in Sec. II-B1 1 to each event T3 ,. Then, 
for each UV Pair, starting with p = () we do the following: 


1) For Ei, add \jpi/w v \ to p. 

2) Consider each sequential pair of events Ei, E i+ 1 , i > 1 . 
If the state Sj = paused, then only \pi + i/w v \ is added 
to p. But if Si = playing, then: 


• If the event e* ^ Skip, then (\jpi/w v \ + 
1, [Pi+i/w v \ - 1, \Pi+x/w v \) is appended to p. 

• If = Skip, then {\_Pi/w v \ + l, ...,\jfljw v \ - 
1 , \p'i/w v \, [pi+i/w v \) is appended instead^] 

For example, suppose h v = 300, w v = 15, and a user gen¬ 
erates Ei = (play, 0, 0, playing, 1.0), E 2 = (skip, 200, 50, 
playing, 1.0), E 3 = (ratechange, 230, 80, playing, 1.25), 
and E 4 = (pause, 300,127, paused, 1.25) on the video. 
Then, p = (0,1, 2,3,13,14,15,15,16,..., 20). 

3) Model factors: There are (at least) three types of infor¬ 
mation for each p u,v that could have an effect on performance: 

(1) Positions. First is the number of times a given position 
i £ V v [w v ) was visited. One would expect these to differ be¬ 
tween CFA and non-CFA, because certain parts of videos will 
be more important to questions. To see this, we can refer to 
two motif groups which were associated with CFA: reflecting, 
which indicates that these sequences may have more visits 
to important positions through pausing, and revising, which 
suggests that these sequences may have more visits to positions 
associated with the questions through repeated revision before 
answering. Further, the skimming motif suggests that non-CFA 
sequences will have less visits to important positions. 

(2) Transitions. Second is the number of transitions between 
the positions, i.e., the number of times a given tuple (i, j) is 
a subsequence of p u ’ v . Considering each tuple (p n , p n+ 1 ): 


• If p n +1 < p n , then the user had skipped back. We call 
this a backward transition. 

• If p n +i > Pn + 1, then the user had skipped over the 
material in (p n ,p n+ i). This is a. forward transition. 

• If Pn+i = Pn + 1, then the user moved directly to the 
next position. This is a direct transition. 

• If Pn+i = pn, then the user had some event within the 
current position. This is a repeat transition. 


We say that direct and repeat transitions are local, whereas 
backward and forward are non-local. As with positions, the 
transition factors can capture the motif behavior associated 
with CFA and non-CFA, except in terms of sequences of visits, 
e.g., backward transitions capture the Sb in a revising motif, 
and forward transitions capture the Sf in a skimming motif. 

(3) Time spent. The amount of time spent at the different 
positions. One would expect these times to be indicative of 
CFA/non-CFA in a similar manner to visit frequencies. 

In order to evaluate the benefit of including each of these 
factors, we will consider three prediction models: Discrete 


Tor brevity, we will typically refer to p“>“ as just p, with the understand¬ 
ing that it refers to the UV Pair in question. 


10 Recall from Sec. 


II B1 


of the video player immediately before the skip. 


that when Ei is a skip event, p'. is the position 


time Positions (DP), which incorporates the number of visits 
to each position; Discrete time Transitions (DT), which models 
transitions between positions; and Continuous time Transitions 
(CT), which factors in inter-arrival times between positions. 
Each model will be tested on each video separately, allowing 
us to compare results on a per-video basis in Sec. [V] 


B. Position-Based Modeling 

Discrete Time Positions (DP). For the DP model, video po¬ 
sitions are treated as independent events. Let F ,c = [ff ! ' c £ 
[0 IjAq-iu,,) k e the probability distribution of visit frequency 
across positions i £ T v [w v ). This is estimated over the UV 
Pairs in the training set c as 


fr=or/Y,°r> 

3 


(i) 


where Of is the number of occurrences of pi over sequences 

in U”fi.e., Or = EueurE n kpn=i}- 

We test the ability of this model to identify which class 


each u £ Uf belongs to. For this purpose, we compute the 
likelihood of observing p on video v to be in c, given f u,c , as 


L(p\rf=g v ’ c l[r p f, ( 2 ) 

n 


Then, the prediction c £ {0,1} of the class for p is determined 
by application of the Maximum a Posteriori (MAP) decision 
rule. But recall that there is a bias towards c = 1 for each 
course (see Fig. [T]). As a result, we introduce a term b v > 0 
into MAP, which will be tuned through the cross validation 
procedure described in Sec. |V-A| 

r 1 g v ’ 1 L(p I F’ 1 ) >g v >°L(p I F’°) + b v 

c = < 0 g v ’ 1 L (p | F- 1 ) < g v ’°L (p | F-°) + b v , 

[l{t/>g”’ 0 } otherwise 

(3) 

where g v ’ c = is the estimated class bias for video 

v, and U denotes a random number drawn from [0,1]. 


C. Transition-Based Modeling 

In modeling transitions between positions, we will only 
consider one-step transitions. This is common in webpage 
clickstream analysis (e.g., | 121 ), and will be useful here since 
the state spaces we consider can be large, depending on 

1) Aggregating non-local transitions: The cohort estimator 
for a Markov Chain model uses the fraction of transitions from 
state i to j in estimating the probability of transitioning from 
i to j ED- We found this model not appropriate here, because 
the number of transitions between two non-local positions is 
rather sparse, implying that there is not enough data to estimate 
these specific transitions. 

To see this, we inspect the sequences p v ’ c for varying w v . 
In particular, for each position in video v, we first find the 


total number of times each type of transition from Sec. IV-A3 


occurs, aggregated across the UV pairs. Then, we sum these 


11 This may not be ideal because unlike sequences of webpages, learning 
builds on itself. It is harder to estimate higher order transitions due to position- 
specific data sparsity. We still see substantial benefit with a one-step model. 
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Fig. 7: Plot of the fraction of local (repeat and direct) and non-local (backward and forward) transitions for each window size w v , averaged 
over all UV Pairs for each position and video v, for each dataset. Clearly, the fraction of non-local transitions is very low in each case, 
reaching a maximum of 2.4% for forward transitions in ‘FMB’ (at w v = 150). 


totals over all positions, and find the fraction of each type of 
transition. We repeat this for each w v £ {5,10,600} (i.e., 
through 10 min), and then average across the videos v for each 
Wv- Fig. 0 shows the result for each course, from which we 
make two observations for local and non-local transitions: 

(i) Tradeoff between local transitions: As w v increases, the 
percentage of repeat transitions increases monotonically (1.7% 
to 59% for FMB, 2.8% to 58% for NI), while the percentage 
of direct transitions decreases monotonically (98% to 40% for 
FMB, 97% to 41% for NI). This is to be expected, since each 
position is increasing in size with w v . 

(ii) Infrequency of non-local transitions: The vast majority of 
transitions are local. For example, from Fig. [7] the largest 
fraction of backward transitions is 2.3% in ‘FMB’, at w v = 
120 . 

As a result of the second observation, the models that 
follow will aggregate all observed forward transitions to form 
a single, uniform probability at each position, and likewise 
for backward transitions. To this end, we define 2} ^ = 
{1,..., i — 1} for k = 1; {*} for k = 2; {* + 1} for k = 3; 
and {i + 2,...} for k = 4 to be the set of states constituting a 
backward (k = 1), repeat (k = 2), direct (/;: = 3), and forward 
(k = 4) transition at position i. 

Discrete Time Transitions (DT). In this model, we discretize 
time, discounting the interarrival times. Let F v ’ c = £ 

[0, l] Ar ( u '>’)> 4 be the matrix of transition probabilities, where 
ff is the probability that the next position will be in T,k 
given the current is i. We also assume that the transitions are 
homogeneous, i.e., independent of time n. 

Considering the sequences of positions p across users u £ 
U' T C , we obtain the number transitions from i to k as 

Of = E E I K=>,mieial‘ ( 4 ) 

u&Af c n 

From (El, we estimate f'fff = Of / f- Of, and the likeli¬ 
hood of p from user u £ Uf on video v is 

L (p I F w ’ c ) = ff ■ n ff Pn+1 , (5) 

n 

where ff is the distribution at the initial position [>\ of p, 
obtained from 0- The MAP for DT is the same as in 0, 
except with ([5]) in place of ([2]). 

Continuous Time Transitions (CT). This model incorporates 
the interarrival times between transitions. Rather than com¬ 
puting the time-varying transition probabilities, we instead 


work with the transition rates HD- To this end, we define 
Q u,c = [qi,k] v,c € TZ N ( w f’ 4 as the transition rate matrix for 
the model, where k ff 2 represents the rate of departure 
from position i and arrival at a position in f t- 

Let r v ’ c = [rf’ 0 £ TZ N ( w f be the vector of the total time 
spent by Uf in state i. These terms are estimated as 


r i’ C = E E 1 !^} - dn ' (6) 

UEUf’ c n 


where d n is the duration of event n in p (see Sec. |II-B 1| ). 
In estimating the q, j., we must also obtain the number of 
transitions from i to k over users u £ Uf, i.e., the Of from 
0 ; with this, the qf terms are estimated as 


k ff 2 
k = 2 ’ 


(7) 


Finally, the likelihood of sequence p for u £ flf is computed: 

L (P I Q l ’ ,c ) = «£) 0i ’ fe ex P (-9i,fc ’ T ‘) > (8 > 

i,k-,k=£2 


where = E n I{ p „=i l( ,„ + iez» fc }, k ff 2 is the number 
of transitions from i to k for the sequence p, and I) = 
f n I{ Pri=i i • d n is the time spent by p in i. Once again, MAP 
is as in 0 . except with (J 0 ) in place of 0 . 

We also considered another position-based model. Contin¬ 
uous Time Positions (CP), which used the time spent at each 
position in likelihood computation. We omit it because its 
results were strictly lower than the other three models. 


V. Prediction Evaluation 

In this section, we evaluate the performance of the models 
described in Section IV We pose the following questions: 

1. How beneficial is it to include positions and transitions for 
CFA prediction on individual videos? 

2. Is one of position or transition-based model clearly better 
than the other, or would some combination be the best? 

3. Is it beneficial to include position durations? 


Skewed-Random (SKR). We will also consider an algorithm 
that does not make use of clickstream data, to act as a baseline 
for evaluating the gain from incorporating behavior. SKR finds 
the CFA bias g v ’ 1 over the training set Uf, and predicts c = 1 
g v ’ 1 of the time (similar to the baseline used in |5j). Note that 
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W v 

avg 

s.d. 

bv 

avg 

s.d. 

Acc 

avg s.d. 

FI 

avg 

s.d. 


W v 

avg 

s.d. 

bv 

avg 

s.d. 

Acc 

avg s.d. 

FI 

avg 

s.d. 

SR 

- 

- 

- 

- 

0.510 

0.073 

0.573 

0.109 

SR 

- 

- 

- 

- 

0.531 

0.069 

0.607 

0.108 

DP 

176 

116 

4.9E-5 

1.3E-4 

0.569 

0.080 

0.645 

0.132 

DP 

75 

35 

3.2E-4 

7.6E-4 

0.589 

0.093 

0.654 

0.176 

DT 

263 

109 

3.5E-5 

1.0E-4 

0.572 

0.084 

0.614 

0.165 

DT 

105 

72 

3.7E-3 

7.8E-3 

0.587 

0.099 

0.652 

0.152 

CT 

212 

99 

2.1E-6 

3.7E-6 

0.558 

0.085 

0.619 

0.162 

CT 

71 

38 

1.6E-5 

3.3E-5 

0.587 

0.097 

0.661 

0.165 


(a) ‘FMB’ (b) ‘NI’ 

Fig. 8 : Summary of the tuned parameters window size ( w v ) and likelihood bias (b v ), and of the performance metrics accuracy (Acc) and 
FI, obtained across the videos for each course, avg and s.d. are calculated over the averages on the 10 evaluation sets for each video. 


in our application of CFA prediction for individual videos, 
more sophisticated baselines that would leverage similarities 
across users and/or quizzes without behavioral data (e.g., 
collaborative filtering like in |T6), |T7j) are not applicable. 

A. Procedure 

Metrics. Let TP, FP, TN, and FN be the number of tme 
and false positives, and true and false negatives obtained by 
a model on an evaluation set. The first metric we consider 
is accuracy, i.e., (TP + TN)/(TP + FP + TN + FN). Since 
the quizzes are biased towards CFA (see Fig. [3]», we found 
that unconstrained maximization of accuracy during the tuning 
procedure (described below) led to high recall (rec), i.e., 
TP/(TP + FN) but low precision (prec), i.e., TP/(TP + FP). 
To avoid this, we will subject tuning to the constraint that 
the chosen parameters have at least 25% of the truly negative 
samples predicted negative, and likewise for the positives. 
To this end, the second metric we consider is the standard 
(balanced) FI score, obtained as 2 ■ (prec +rec)/(prec Tree) 
m- As the harmonic mean of precision and recall, FI is 
limited by the minimum of the two, capturing the tradeoff 
between them that is induced by this constraint. 

Even a few percent improvement in these metrics can be a 
substantial gain for CFA prediction. As a reference, in KDD 
Cup 2010 for CFA prediction there was only 1% improvement 
from the 132nd to the best score on the leaderboard ©■ 
Training and testing. We consider N evaluation iterations for 
each video. In each iteration, we use the following procedure: 

1) Divide the elements of U v into K disjoint folds 

W(, , ...,U R . In doing so, we randomly allocate sam¬ 

ples of CFA and non-CFA to folds, ensuring that the 
number of class instances is equal across folds (e.g., 

k c \ = k c \ v*,o. 

2) Set U V E = U V K and U%=U V \ U V K . 

3) Using Uj~, tune the algorithm parameters w v and b v 
through the parameter tuning procedure described below. 

4) With the tuned values, train the quantities required to 
compute the likelihoods and MAP of each model over 
the full U E , and evaluate on U E . 

The results for each metric are averaged over the N itera¬ 
tions. In our evaluation, we set TV = 10 and K = 5. 
Parameter tuning. Each algorithm has two parameters that 
must be tuned: the video width w v and the likelihood bias b v . 
To do this, we apply Cross-Validation (CV) as described in 
©D over the K — 1 training set partitions. The following is 
the procedure for each CV iteration k £ {1,..., K — 1}: 

1) Set U v c = Ul and U V R =U'!f\ U % ’. 

2) Obtain the results of training on U R and testing on Uq 
for each pair (w v , b v ) £ {5,10,..., 20, 30,45,..., 600} x 
{0, 2 -60 , 2 -58 ,..., 1}, i.e., a total of 1,376 pairs. 


In the end, we select the combination of parameters (w v , b v ) 
which yields the highest average accuracy over the CV itera¬ 
tions, subject to the constraint described with the metrics. Note 
that for w v , we choose this set since (i) 5 sec corresponds to 
the threshold of combining repeat events (see Sec. |1I-BI] >, and 
(ii) 600 is close to the minimum video length in both courses. 
For both parameters, these choices ensured that most selections 
across videos did not lie on one of the grid endpoints. 


B. Results and Discussion 


Since there is a sharp dropoff in quiz participation over 
time, we only consider those for which there are at least 100 
samples of both CFA and non-CFA instances, so that there at 
least 20 samples from each group in each of the five folds. 
This leaves a total of 24 videos for ‘FMB’ and 32 for ‘NI’. 
Overview of results. Summary information on the tuned w v 
and b v values, as well as the two performance metrics - 
Accuracy (Acc) and FI - can be found for each course in 
Fig. [ 8 ] Here, we give the average (avg) and standard deviation 
(s.d.) of these values across videos. The distribution of the 
performance values are plotted for each course in Fig. |9j in 
each box, the performance on one video is one data point. 

From Fig. [9] we can see immediately that the DP, DT, and 
CT algorithms perform substantially better than SKR overall. 
Further, the improvement is higher for accuracy than for FI, 
which is expected since the tuning monitors accuracy. In order 
to test for significance in the performance differences between 
each pair of models, we run a WRS test (as in Sec. II-B| ) for 
the null hypothesis that there is no difference between the 
distributions in Fig. [9] The resulting p-values (p) from these 
tests are tabulated in Fig. [TO] and verify the differences. 

Finally, in Fig. [XT] we plot the percent increase in perfor¬ 
mance for each of the algorithms relative to SKR on each of 
the videos, for a more specific case-by-case comparison. 

1: Benefit of clickstream data. We assess how beneficial 
the position and transition information is for prediction by 
comparing each of the algorithms to SKR. 

Accuracy. Considering accuracy first, refer to Fig. |9ja&c). 
Here, we see that the DP, DT, and CT models are clearly 
shifted to the right relative to SKR, indicating higher quality. 
For ‘FMB’, the shift in the mean of DP and DT relative to SKR 
is roughly 12%, and of CT is 9%; for ‘NI’, the improvements 
are roughly 11% for each of the algorithms. From Fig. 10 we 


see that this difference is also statistically significant for each 
algorithm across both courses, with p < 0.02 in each case. 

As for individual videos in Fig. |TT}a&c), we see that each 
algorithm outperforms SKR in the vast majority of cases, 
across both datasets. The fraction of times in which DP, DT, 
and CT outperform SKR in ‘FMB’ (‘NI’) is 100% (97%), 96% 
( 88 %), and 92% (91%), respectively. 
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(a) ‘FMB', accuracy 
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DP 
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(b) ‘FMB', FI 
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(c) ‘NT, accuracy 

Fig. 9: Boxplots of CFA prediction performance across both courses, considering accuracy and FI. Here, each datapoint is the obtained 
performance on one of the videos considered. Overall, we see that DP, DT, and CT outperform SKR for both metrics, and especially for 
accuracy, while CP performs comparable to SKR. 


SKR DP DT CT 


SKR 

- 

2.5E-3*’ 

2.2E-3** 

0.018* 

DP 

2.5E-3** 

- 

0.75 

0.72 

DT 

2.2E-3** 

0.75 

- 

0.28 

CT 

0.018* 

0.72 

0.28 

- 



(a) ‘FMB’, accuracy 
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CT 

SKR 

- 
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DP 
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- 

0.91 

0.79 

DT 

0.019* 

0.91 

- 

0.94 

CT 

0.015* 

0.79 

0.94 

- 



(c) ‘NI’, accuracy 



Fig. 10: p -values ( p ) from applying pairwise WRS tests to the 
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- 
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DP 

0.014* 

- 

0.85 

0.77 

DT 

0.16 

0.85 

- 

0.98 

CT 

0.065 

0.77 

0.98 
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(b) ‘FMB’, FI 
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DT 

CT 

SKR 

- 

0.012* 

0B45* 

6.3E-3’* 

DP 

0.012* 

- 

0.90 

0.99 

DT 

0.045* 

0.90 

- 

0.86 

CT 

6.3E-3** 

0.99 

0.86 

- 


(d) ‘NI’, FI 

in Fig. [5] A * indicates significance at p < 0.05, and ** at p < 0.01. 


FI score: For FI, refer to Fig.[9]b&d). Again, we see that DP, 
DT, and CT are shifted to the right relative to SKR overall, 
but not as substantially. This is especially true for DT, which 
has the highest range of FI scores. For DP, the increase in 
mean performance of roughly 13% for ‘FMB’ and 8% for 
‘NI’ are both significant, with p < 0.02 from Fig. [TO] Both 
DT and CT have significant improvements of 7% and 9% in 
‘NT {]> < 0.05); the improvements of 7% and 8% in ‘FMB’ 
are also substantial, but not significant (p > 0.06). The number 
of videos in which each algorithm outperforms SKR is also 
lower than for accuracy; for DP, DT, and CT, these numbers 
for ‘FMB’ (‘NI’) are 92% (84%), 79% (81%), and 88% (91%). 

In Fig. [TT[b&d), we remark that there are a total of six 
videos (indexes 5, 16, and 24 in ‘FMB’, and 3, 27, and 31 in 
‘NI’) where most of the DP, DT, and CT algorithms perform 
substantially worse than SKR in FI-score. These videos also 
correspond to the outliers observed below the first quartiles in 
Fig.[9]b&d). One would expect that these would be instances 
where SKR already had high performance due to a high bias 
(skew) in favor of either CFA or non-CFA (e.g., a video with 
an easy or a hard quiz). Surprisingly, the opposite is tme: the 
FI and accuracy scores obtained by SKR on these five videos 
are all within the bottom nine of all videos. There is also no 
consistency among the CFA biases (half above 0.5, half are 
below). Further, there are other videos with biases in the same 
ranges where the algorithms outperform SKR substantially. 

2: Positions vs. transitions. For this, we compare DP to DT. 
In terms of accuracy, in Fig.[9[a&c) we see that the algorithms 
are comparable for both courses on average. As for FI in 
Fig. |9[b&d), DP has modestly better average performance, 
especially for ‘FMB’ where it has an improvement of roughly 
5%. DT has a higher range in each case (excluding outliers). 


with generally lower performance than DP below quartile Q2 
(e.g., in FI for ‘FMB’) but, in accuracy for ‘FMB’, also higher 
above Q2. When considering individual videos in Fig. 11 DT 
and DP each perform better in roughly 50% of the cases, with 
the exception of accuracy in ‘FMB’ for which DT is higher the 
majority of the time. Overall, the differences between DT and 
DP are not statistically significant for either course or metric, 
with p > 0.75 in all cases in Fig. [lO] 

3: Discrete vs. continuous. Finally, we compare DT to CT. 
In Fig. [9] in terms of accuracy: For ‘FMB’, DT is shifted 
to the right by roughly 3% relative to CT, whereas for ‘NI’, 
the algorithms are comparable. As to Fl-score, while DT and 
CT are comparable overall, the distribution for CT is slightly 
shifted to the right for both courses. Considering individual 
videos, DT outperforms CT on more videos in each of the four 
cases in Fig. o In particular, for accuracy (FI), it outperforms 
in 75% (58%) of the cases for ‘FMB’, and 69% (63%) for ‘NI’. 
Still, overall, the differences are not statistically significant for 
either course or metric, with p > 0.28 in all cases. 

Key messages. Many aspects of position-based video behavior 
are useful for CFA prediction: the frequency of visits to each 
position (DP), the frequency of transitions between positions 
(DT), and transitions incorporating holding times (CT). These 
benefits are also measured on individual videos, which under¬ 
scores the applicability of these models to situations where 
there is not a lot of information across multiple lectures, 
e.g., for quick detection early in a course. Both positions and 
transitions can be useful; DP, DT, and CT are comparable 
overall, performing better on different sets of videos. 

Each of the algorithms tested here employ feature spaces 
that are representing user behavior directly; namely, positions 
visited and transitions. Higher quality predictions may be 
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Fig. 11: Percent improvements of each algorithm relative to SKR for individual videos, in each course for each metric. Consistent with Fig. 
[9] we see that each algorithm outperforms SKR in the vast majority of cases, except for six videos with respect to FI. 


attainable by passing these through more complicated machine 
learning algorithms (e.g., kernel SVM) to learn over higher 
dimensional spaces. A significant advantage of our feature 
spaces, though, is their natural interpretation in terms of 
learner actions, which can be related to CFA results. An inter¬ 
esting avenue of future work would be to use the position and 
transition matrices inferred over the CFA classes to generate 
recommendations guiding learner behavior in real time. 

VI. Related Work 

We discuss recent, key works on MOOC, student video¬ 
watching analysis, and CFA prediction. 

MOOC studies. With the proliferation of MOOC in recent 
years, there have been a number of analytical studies on these 
platforms. Some have focused on a more general analysis of 
all learning modes, e.g., (T9) studied learner engagement 

variation over time and across courses. Others have focused 
on specific modes, e.g., in terms of forums, |3] analyzed 
the decline in participation over 73 courses. Our work is 
fundamentally different from these works in that is explores 
the association between behavior with two modes: video and 
assessment. 

Video-watching analysis. Most existing work on learner 
video-watching behavior [5|, [9j, (20] has focused session- 
level user characteristics (e.g., rewatching sessions), rather 
than click-level information. The work in 0 is most similar 
to ours, since it is also concerned with recurring patterns in 
clickstream sequences for MOOC users. The authors define a 
mapping of subsequences of events to predefined behavioral 
actions (e.g., skipping, slow watching) and perform approxi¬ 
mate string search to locate these behaviors in clickstreams. 
Our work on motif identification differs in two important ways: 
(i) rather than assuming a predefined set of actions, we extract 


the recurring sequences directly using motif identification 
algorithms, and (ii) we are concerned with mapping motifs 
to efficacy, in contrast to (7) where the objective is to predict 
engagement, next click, and dropout. 

Performance prediction. Researchers have developed pre¬ 
dictors for whether a student will be CFA or not on a 
question in traditional education settings. Collaborative fil¬ 
tering algorithms have been applied as classification models 
for this purpose (e.g., ]T6), 117] ). Others have probabilistic 
graphical models (PGMs)pT |, when there is coarse-granular 
information collected (e.g., course difficulty) over multiple ses¬ 
sions. Recently, 0 developed SPARFA-Trace, which traces a 
learner’s knowledge through the sequence of material accessed 
and questions answered. Compared with these works, ours is 
unique in that (i) it focuses on relating click-level data - video¬ 
watching behavior - to performance, and (ii) it focuses on pre¬ 
diction within single videos. The recent work of |5] studied the 
predictive capability of session-level video-watching quantities 
computed from clickstream data (e.g., the fraction of the video 
watched and the number of rewinds), considering multiple 
users and videos in the course simultaneously. Focusing on 
individual videos, our models are instead position-dependent, 
and the improvements in accuracy relative to the baseline that 
we obtain are strictly higher than those cited here (3% increase 
to the same baseline). Overall, we emphasize that the models 
used in each of these other works are not readily applicable to 
our setting, because we focus on the case of individual videos 
where similarities among users/quizzes is not available. 
Webpage clickstream analysis. Webpage clickstream analysis 
0 , 0 0 remains an active area of research. Video¬ 
watching clickstreams are fundamentally different than these 
applications, which concern transitions between webpages 
rather than behavior within a single window. 
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VII. Conclusion and Future Work 

In this work, we studied student video-watching behavior, 
performance, and their association in MOOC. In doing so, 
we formalized two frameworks for representing user click- 
streams: one based on sequences of events with discretized 
lengths, and one based on sequences of positions visited. 
With datasets from two MOOCs encoded in these frameworks, 
we accomplished two goals: (i) we mined the sequences to 
identify recurring motifs in user behavior, and discovered 
that some of these characteristics are significantly associated 
with CFA and non-CFA quiz submissions; (ii) we proposed 
models for relating user clickstreams to knowledge gained, 
and showed how multiple aspects of this behavior can improve 
CFA prediction quality on individual videos. 

There are a number of next steps we are investigating, e.g., 
to use the identified motifs for user and content analytics; 
to optimize the selection of quantiles used divide the event 
lengths; to consider position transition durations under a 
non-exponential assumption; and to see whether prediction 
improvement can be obtained through higher order transitions. 
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